library(dplyr)
library(ggplot2)
library(inlabru)
library(INLA)
library(terra)
library(sf)
library(scico)
library(magrittr)
library(patchwork)
library(tidyterra)
# We want to obtain CPO data from the estimations
bru_options_set(control.compute = list(dic = TRUE,
waic = TRUE,
mlik = TRUE,
cpo = TRUE))Practical 8 - Zero Inflated
Aim of this practical:
In this practical we are going to look zero-inflated models.
ZIP/ZAP and Hurdle Models
In this practical we are going to work with data with excess zeros. We will
Fit a zero inflated model
Fit a hurdle model
Fit a hurdle model using two likelihoods and a shared component
Data Preparation
The following example use the gorillas dataset available in the inlabru library.
The data give the locations of Gorilla’s nests in an area:
gorillas_sf <- inlabru::gorillas_sf
nests <- gorillas_sf$nests
boundary <- gorillas_sf$boundary
ggplot() + geom_sf(data = nests) +
geom_sf(data = boundary, alpha = 0)The dataset also contains covariates in the form or raster data. We consider two of them here:
gcov = gorillas_sf_gcov()
elev_cov <- gcov$elevation
dist_cov <- gcov$waterdistNote: the covariates have been expanded to cover all the nodes in the mesh.
To obtain the count data, we rasterize the species counts to match the spatial resolution of the covariates available. Then we aggregate the pixels to a rougher resolution (5x5 pixels in the original covariate raster dimensions). Finally, we mask regions outside the study area.
In addition we compute the area of each grid cell.
# Rasterize data
counts_rstr <-
terra::rasterize(vect(nests), gcov, fun = sum, background = 0) %>%
terra::aggregate(fact = 5, fun = sum) %>%
mask(vect(sf::st_geometry(boundary)))
plot(counts_rstr)# compute cell area
counts_rstr <- counts_rstr %>%
cellSize(unit = "km") %>%
c(counts_rstr)To create our dataset of counts, we extract also the coordinate of center point of each raster pixel. In addition we create a column with presences and one with the pixel area
counts_df <- crds(counts_rstr, df = TRUE, na.rm = TRUE) %>%
bind_cols(values(counts_rstr, mat = TRUE, na.rm = TRUE)) %>%
rename(count = sum) %>%
mutate(present = (count > 0) * 1L) %>%
st_as_sf(coords = c("x", "y"), crs = st_crs(nests))We then aggregate the covariates to the same resolution as the nest counts and scale them.
elev_cov1 <- elev_cov %>%
terra::aggregate(fact = 5, fun = mean) %>% scale()
dist_cov1 <- dist_cov %>%
terra::aggregate(fact = 5, fun = mean) %>% scale()Mesh building
We now define the mesh and the spde object.
mesh <- fm_mesh_2d(
loc = st_as_sfc(counts_df),
max.edge = c(0.5, 1),
crs = st_crs(counts_df)
)
matern <- inla.spde2.pcmatern(mesh,
prior.sigma = c(1, 0.01),
prior.range = c(5, 0.01)
)In our dataset, the number of zeros is quite substantial, and our model may struggle to account for them adequately. To address this, we should select a model capable of handling an “inflated” number of zeros, exceeding what a standard Poisson model would imply. For this purpose, we opt for a “zero-inflated Poisson model,” commonly abbreviated as ZIP.
Zero-Inflated model (Type1)
We fit now a Zero-Inflated model to our data.
The Type 1 Zero-inflated Poisson model is defined as follows:
\[ \text{Prob}(y\vert\dots)=\pi\times 1_{y=0}+(1-\pi)\times \text{Poisson}(y) \]
Here, \(\pi=\text{logit}^{-1}(\theta)\)
The expected value and variance for the counts are calculated as:
\[ \begin{gathered} E(count)=(1-\pi)\lambda \\ Var(count)= (1-\pi)(\lambda+\pi \lambda^2) \end{gathered} \tag{1}\]
This model has two parameters:
- The probability of excess zero \(\pi\) - This is a hyperparameter and therefore it is constant
- The mean of the Poisson distribution \(\lambda\). This is linked to the linear predictor as: \[ \eta = E\log(\lambda) = \log(E) + \beta_0 + \beta_1\text{Elevation} + \beta_2\text{Distance } + u \] where \(\log(E)\) is an offset (the area of the pixel) that accounts for the size of the cell.
Once the model is fitted we can look at the results
Hurdle model (Type0)
We now fit a hurdle model to the same data.
In the zeroinflatedpoisson0 model is defined by the following observation probability model
\[ \text{Prob}(y\vert\dots)=\pi\times 1_{y=0}+(1-\pi)\times \text{Poisson}(y\vert y>0) \]
where \(\pi\) is the probability of zero.
The expectation and variance of the counts are as follows:
\[ \begin{aligned} E(\text{count})&=\frac{1}{1-\exp(-\lambda)}\pi\lambda \\ Var(\text{count})&= E(\text{count}) \left(1-\exp(-\lambda) E(\text{count})\right) \end{aligned} \tag{2}\]
Hurdle model using two likelihoods
Here the model is the same as in Section 1.3, but this time we also want to model \(\pi\) using covariates and random effects. Therefore we define a second linear predictor \[ \eta^2 =\beta_0^2 + \beta_1^2\text{Elevation} + \beta_2^2\text{Distance} + u^2 \] Note here we have defined the two linear predictor to use the same covariates, but this is not necessary, they can be totally independent.
To fit this model we have to define two likelihoods: - One will account for the presence-absence process and has a Binomial model - One will account for the counts and has a truncated Poisson model
Comparing models
We have fitted four different models. Now we want to compare them and see how they fit the data.
Comparing model predictions
We first want to compare the estimated surfaces of expected counts. To do this we want to produce the estimated expected counts, similar to what we did in Section 1.2 and Section 1.3 for all four models and plot them together:
pred_zip <- predict(
fit_zip,
counts_df,
~ {
pi <- zero_probability_parameter_for_zero_inflated_poisson_1
lambda <- area * exp( distance + elevation + space + Intercept)
expect <- (1-pi) * lambda
variance <- (1-pi) * (lambda + pi * lambda^2)
list(
expect = expect
)
},n.samples = 2500)
pred_zap <- predict( fit_zap, counts_df,
~ {
pi <- zero_probability_parameter_for_zero_inflated_poisson_0
lambda <- area * exp( distance + elevation + space + Intercept)
expect <- ((1-exp(-lambda))^(-1) * pi * lambda)
list(
expect = expect)
},n.samples = 2500)
inv.logit = function(x) (exp(x)/(1+exp(x)))
pred_zap2 <- predict( fit_zap2, counts_df,
~ {
pi <- inv.logit(Intercept_presence + elev_presence + dist_presence + space_presence)
lambda <- area * exp( dist_count + elev_count + space_count + Intercept_count)
expect <- ((1-exp(-lambda))^(-1) * pi * lambda)
list(
expect = expect)
},n.samples = 2500)
pred_zap3 <- predict( fit_zap3, counts_df,
~ {
pi <- inv.logit(Intercept_presence + elev_presence + dist_presence + space_copy)
lambda <- area * exp( dist_count + elev_count + space + Intercept_count)
expect <- ((1-exp(-lambda))^(-1) * pi * lambda)
list(
expect = expect)
},n.samples = 2500)
p = data.frame(x = st_coordinates(counts_df)[,1],
y = st_coordinates(counts_df)[,2],
zip = pred_zip$expect$mean,
hurdle = pred_zap$expect$mean,
hurdle2 = pred_zap2$expect$mean,
hurdle3 = pred_zap3$expect$mean) %>%
pivot_longer(-c(x,y)) %>%
ggplot() + geom_tile(aes(x,y, fill = value)) + facet_wrap(.~name) +
theme_map + scale_fill_scico(direction = -1)Using scores
We can compare model using the scores that the bru() function computes since we have set, at the beginning. the options to
bru_options_set(control.compute = list(dic = TRUE,
waic = TRUE,
mlik = TRUE,
cpo = TRUE))Lets use these scores to compare the models.
From the table above we can see that the model that best balances complexity and fit is the zero inflated one (ZIP).